EmergencyReparentShard: filter lagging candidates from relay-log wait, tolerate partial failures#18707
EmergencyReparentShard: filter lagging candidates from relay-log wait, tolerate partial failures#18707timvaillancourt wants to merge 32 commits into
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18707 +/- ##
===========================================
+ Coverage 69.67% 76.17% +6.50%
===========================================
Files 1614 18 -1596
Lines 216793 6229 -210564
===========================================
- Hits 151044 4745 -146299
+ Misses 65749 1484 -64265
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Copying some conclusions from an offline discussion with @arthurschreiber:
|
|
The reason, AFAIUI, for the current behavior is to prevent any of the healthy tablets from becoming forever unhealthy/unusable due to the new primary not having binary logs covering/containing the GTIDs that the lagging replica(s) may still need (the new primary may have recently been restored from a backup and have minimal binary logs). Have you already thought about that in this context? |
@mattlord I don't think that has changed, but I would appreciate you double checking my assumption because that is a very important functionality One part of the code that could affect that was actually changed in #18531. Previous to this PR, the code called The TL;DR on that wrapper func: we do the same sort but put now prioritise positions with the most advanced SQL thread if two combined sets are equal. In the end the same And in terms of being certain we're ignoring the right tablets: after |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@mattlord thanks for the review! I'm leaning towards keeping it together, but let me know if separating the "split-brain" scenario makes this an easier review. Shipping together makes it more clear it will all work together though
Good idea, implemented in ef09389
Sounds good 👍. Implemented in ef09389 |
…-lagging Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
In go/vt/vtctl/reparentutil/emergency_reparenter.go:438-440 we shouldn't drop failed-ERS replication cleanup before reparenting has actually recovered the replicas. At this point ERS has only applied relay logs; replicas whose IO_THREAD was stopped are still stopped. Clearing replicasToRestart here means later aborts, such as promoteIntermediateSource failure, waitForCatchUp timeout, lost topo lock, or final promotion/repoint failure, skips deferred cleanup and can leave replicas with IO stopped. Please keep cleanup state until affected tablets have actually been repointed/started, or track per-tablet state, and add a regression test that fails after this point.
The text in changelog/25.0/25.0.0/summary.md:140-143 around the split-brain override limitations are stale. The code now restores the leading set when no leading candidate survives errant-GTID detection, so the “lagging tablet can still be picked” warning is no longer accurate. The sorter also now has deterministic promotion-rule/alias tie-breakers, so the map-iteration warning is misleading. Please update the release note to match the current behavior. Actually... I would just remove this from the summary. This is closer to a bug fix and doesn't need to be highlighted in the v25 release summary IMO.
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
mattlord
left a comment
There was a problem hiding this comment.
In go/vt/vtctl/reparentutil/emergency_reparenter.go:438-483 the cleanup list now survives into the final promotion/reparent phase, so I think that any error after PromoteReplica/InitPrimary succeeds can trigger the deferred cleanup and call StartReplication on the just-promoted primary. That can then run START REPLICA on a primary, apply replica semisync settings to it, and wrap/mask the real ERS failure. The new regression test covers PromoteReplica failing, but not failures after promotion succeeds, e.g. PopulateReparentJournal or later replica-repointing errors. I think that we should keep cleanup for pre-promotion aborts, but remove/prune the primary-elect from replicasToRestart once promotion succeeds, and add a post-promotion failure regression test.
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@arthurschreiber / @mattlord thanks for reviews! I believe I have address the open concerns, please validate when you have time 🙇 |
mattlord
left a comment
There was a problem hiding this comment.
In go/vt/vtctl/reparentutil/emergency_reparenter.go:202-245, 516-540 the deferred failed-ERS cleanup still runs after PromoteReplica/InitPrimary succeeds and only filters out the promoted tablet. If PopulateReparentJournal or the final SetReplicationSource fanout fails after the new primary is already writable, cleanup calls StartReplication on the other tablets whose IO thread ERS stopped. StartReplication does not repoint them; it resumes their existing source and applies semisync relative to prevPrimary, which can reconnect replicas to the old primary/source after a new primary has been promoted. Please disable this generic pre-promotion cleanup once promotion succeeds, or track per-tablet final-repoint success and only perform a deliberate post-promotion recovery against the new primary.
In go/vt/vtctl/reparentutil/replication.go:286-303 the SQL-stopped cleanup path still does not restore the pre-ERS state. For tablets with IO running and SQL intentionally stopped, ERS stops IO via StopReplicationAndGetStatus(... IOTHREADONLY), but replicasWithStoppedIO now excludes them from cleanup to avoid StartReplication starting SQL. That avoids starting SQL, but leaves IO stopped after a pre-promotion abort. Please restore the exact original thread state, e.g. start only IO for this class, and add a regression that verifies SQL remains stopped while IO is restarted.
|
Thanks for the careful re-read, @mattlord. Splitting these: Issue 1 (post-promotion cleanup): Agreed, will fix in the next push. Approach: skip the deferred Issue 2 (SQL-stopped class): Valid improvement, but I'd like to defer it to a follow-up for scope/review velocity if you're OK with that. Quick context: in your earlier review on this PR (#18707 (review)) you offered two options for the SQL-stopped class — "restore the original thread state exactly, or avoid calling full The exact-state restoration you're now describing (start IO, leave SQL stopped) is the stronger fix, but it needs a new tabletmanager capability — no current RPC issues If you'd rather have it in this PR I'll do it — just wanted to flag the scope first. |
- skip failed-ERS cleanup once promotion has been attempted (mattlord); flip primaryPromoted before InitPrimary/PromoteReplica so a post-side- effect RPC error doesn't run StartReplication against a tablet that may already be writable (codex P1) - findMostAdvanced Safety net #2: reject applied candidates strictly dominated by another eligible applied candidate (codex P1) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…agging Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> # Conflicts: # changelog/25.0/25.0.0/summary.md
|
Update: converting to draft as I separate the GTID-based optimization and the split-brain fix, as discussed, in order to make review more streamlined |
Strip the split-brain-detection feature from this branch so reviewers can evaluate the GTID-based ERS optimization on its own. The split-brain pieces (`AllowSplitBrainPromotion` flag, upfront uniformity check, errant-GTID restoration safeguard, override paths in findMostAdvanced) move to a separate ers-split-brain-detection branch built independently on main. What stays here: - Filter to the leading Combined position before relay-log apply, tolerate partial relay-log-apply failures, and bump applied tablets' Executed to Combined so the sorter prefers them. - replicasWithStoppedIO returns SQL-stopped replicas as a separate slice so cleanup can skip-and-warn instead of silently restarting them. - reparent_sorter handles incomparable GTID positions deterministically. - newPrimary/primaryPromoted deferred-cleanup safeguard. - EmergencyReparentFilteredCandidates and EmergencyReparentRelayLogFailedCandidates stats. AGENTS.md and the changelog still describe both features and reach their full state once the split-brain follow-up also lands. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
The split-brain detection (upfront FAILED_PRECONDITION abort), --allow-split-brain-promotion flag, and EmergencyReparentSplitBrainOverrides stat are not part of this branch's code. They moved to the follow-up ers-split-brain-detection PR. This commit strips the changelog text that described them so the release notes match what's actually delivered here. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…toff Five fixes from successive codex and Claude Opus paranoid reviews: 1. First-wait guard: when filterToLeadingGroup keeps multiple non-dominated candidates (incomparable Combined positions), force requireAll=true. The one-success short-circuit would otherwise let applyRelayLogsAndReconcile delete a failed incomparable leader from validCandidates, bypassing the AtLeast split-brain check in findMostAdvanced. Includes a regression test verified to fail without the guard. 2. Second-wait guard: same uniformCombined check on the errant-GTID re-wait pass, so survivors with incomparable Combined positions are subject to the same requireAll-on-failure semantics. 3. Cleanup cutoff: replicationMutated (renamed from primaryPromoted) now flips before promoteIntermediateSource — once SetReplicationSource has started mutating replicas, the deferred StartReplication cleanup is unsafe whether or not we reach the final PromoteReplica. 4. Deadline-as-cancellation: waitForAllRelayLogsToApply now treats every waiter error after parent ctx.Err()!=nil as expected noise, not just context.Canceled. Parent-ctx DEADLINE_EXCEEDED was previously recorded as a per-tablet failure and could bump EmergencyReparentRelayLogFailedCandidates on operator timeout. 5. validCandidates non-mutation regression test: asserts that on the requireAll-with-failure path, applyRelayLogsAndReconcile returns the error before the reconcile loop runs, leaving validCandidates and its pointed-to RelayLogPositions bit-identical to the input. Guards (1) and (2) depend on this property to be load-bearing. Reverts the in-PR attempt to leave SQL-thread-stopped replicas at their pre-ERS state — deferred to vitessio#20256 which adds the missing StartReplicationMode_IOTHREADONLY primitive across the tabletmanager surface. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…agging Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
When running Vitess in production, one of the most common problems preventing
EmergencyReparentShardoperations from occurring quickly and without error is outlier lagging tablets. Some situations where this occurs:REPLICAs that struggle to keep up in replication - typically due to query loadRDONLYsToday, tablets lagging in these scenarios are likely to cause ERS to take longer and/or timeout 😱
After
StopReplicationAndGetStatuscompletes during ERS, replication is stopped on all tablets and theirCombined(relay log) positions are frozen. Today ERS waits for all tablets to apply relay logs before picking a winner — but a tablet whoseCombinedposition is strictly behind the leading group can never win. Waiting for it is pointless and can only hurt us (by timing out the entire ERS over a tablet that was never a contender)Filtering out known-problem tablets early makes reparents less brittle and faster 🚀
This PR addresses #18529 by making ERS more tolerant of lagging tablets, while preserving the
AGENTS.mdERS contract that we must pick the most-advanced candidate with certaintyNote
The split-brain detection /
--allow-split-brain-promotionfollow-up that originally lived in this PR has moved to #20289 to make review more streamlinedWhat changed
GTID-only relay-log-apply optimisation — for GTID-based replication,
Combinedis the received relay-log position, so we can prove which tablets can never win and filter them out of the wait. Tablets at the leadingCombinedposition are waited on; the first success short-circuits the rest. Lagging tablets stay in the candidate pool (they can still be repointed at the new primary later), they just don't block the wait phase. For non-GTID flavours (FilePos, MariaDB) this optimisation is unsafe —Combinedthere is the executed position and doesn't reflect received-but-not-applied relay logs — so those flavours keep the pre-PRrequireAll=truebehaviour unchangedPairwise-dominance filter —
filterToMostAdvancedCombined()uses pairwise dominance (not "compare to a single max") so it correctly handles partially-ordered GTID sets: when two candidates have disjoint UUIDs neither dominates the other, so both stay in the leading group. Comparing against a single chosen max would silently drop one incomparable maximumErrant-GTID re-wait pass — errant filtering can remove every originally-applied tablet, leaving only "unwaited survivors" (tablets cancelled mid-apply by our own short-circuit, or strictly-behind tablets we excluded from the first wait). Promoting one of those would risk a primary with received-but-unapplied transactions 😱 We detect this case (no applied tablet survives errant detection) and run a second filter → wait pipeline over the survivors before promotion
Implementation details
A few smaller mechanics behind the headline changes:
applyRelayLogsAndReconcile()helper — applied tablets getExecutedbumped toCombinedso the existing sorter prefers them via theCombined→Executed→ promotion-rule tiebreak; failed tablets are removed fromvalidCandidates, cancelled ones are left untouchedstatusMap(current/stuck PRIMARY fromErrNotReplica) has no relay logs to apply, so it's markedtrueinsuccessMapupfront. This bumps itsExecuted = Combinedin reconcile, preventing cancelled-mid-apply peers (non-zero pre-waitExecuted) from sorting ahead of it infindMostAdvancedgroupCancel()or from a cancelled parent context are treated as expected noise (must check botherrors.Is(err, context.Canceled)and gRPC-wrappedvtrpc.Code_CANCELED, since the latter does not satisfy the former). When the parent ctx is cancelled with no real failure to surface, we return the wrappedctx.Err()directly so operators see "aborted while waiting for relay logs to apply"PromoteReplica/InitPrimaryreturns nil on the promotion target, the deferred replication-restart cleanup must not run on that tablet (would callStartReplicationon a now-PRIMARY). AprimaryPromotedflag is plumbed throughreparentReplicasso the cleanup filters the promoted tablet outStats
Two new stats are exported for observability:
EmergencyReparentFilteredCandidates{Keyspace, Shard}— tablets excluded from the relay-log wait because theirCombinedposition is strictly behind the leading groupEmergencyReparentRelayLogFailedCandidates{Keyspace, Shard}— tablets that genuinely failed to apply relay logs (RPC error, MySQL error, or timeout). Cancellations after a peer succeeded — or after parent-ctx cancel — are not counted. The metric is also incremented on the error path before any abort returns, so operators can still see failure counts when ERS aborts for some other reasonTesting
E2E tests in
ers_test.go:TestERSFiltersNonMostAdvancedCandidates— stops the IO thread on one replica, writes data the lagger won't see, kills the primary, runs ERS, then assertsEmergencyReparentFilteredCandidatesincremented via the vtctld/debug/varsendpoint and that the lagging tablet was not chosen as the new primaryTestReplicationStoppedupdated — previously asserted ERS failed with 2 replicas having replication stopped; now asserts ERS succeeds with the third tablet as the surviving candidate, since partial relay-log failures are toleratede2eutils.SkipIfBinaryIsBelowVersion(t, 25, "vtctld")so theReparent Old Vtctlupgrade-downgrade job (which runs against the previous release'svtctld) skips them cleanlyRelated Issue(s)
Resolves: #18529
Checklist
Deployment Notes
This is a user-visible behavioural change in
EmergencyReparentShardand is called out inchangelog/25.0/25.0.0/summary.mdunder VTCtld. No new required flags or configuration:GTID-based shards — ERS now tolerates a minority of replicas failing or hanging during the relay-log-apply wait. As long as at least one tablet at the leading
Combinedposition applies successfully (or a no-status PRIMARY-like tablet is present at that position), ERS proceeds. The pre-PR behaviour (any single failure aborts ERS) was unnecessarily fragileNon-GTID shards (FilePos, MariaDB) — behaviour is unchanged. The optimisation is gated on
isGTIDBasedand these flavours still wait for every candidateTwo new stats (
EmergencyReparentFilteredCandidates,EmergencyReparentRelayLogFailedCandidates) are exported from vtctld for operators to track filter and failure counts per keyspace/shardAI Disclosure
Claude Code assisted with development, testing and this PR summary